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Executive  Summary 


Developing  situation  awareness  is  vital  for  almost  any  kind  of  military  operation.  Through  understanding 
the  state  and  nature  of  the  environment,  military  personnel  can  plan  and  respond  accordingly.  Situation 
awareness  is  often  treated  as  the  problem  of  knowing  where  all  the  potential  targets  are.  Through  knowing 
the  locations  of  these  targets,  threats  can  be  identified  and  countered.  To  meet  these  needs,  Wide  Area 
Surveillance  (WAS)  systems  have  been  developed  which  are  able  to  sense  large  swaths  of  an  environment 
simultaenously  and  at  high  resolution.  However,  the  next  key  challenge  is  to  automatically  analyse  this 
image  data  to,  for  example,  track  the  locations  of  targets  and  identify  potential  anomalous  behaviour. 

This  report  begins  to  explore  how  the  output  from  a  WAS  can  be  used  by  a  state-of-the-art  multi-target 
tracking  system.  In  particular,  we  considered  how  the  output  of  the  image  processing  and  matching  algo¬ 
rithms  used  in  the  Likelihood  of  Features  Tracker  (LoFT)  could  be  combined  with  a  Probabilistic  Hypothesis 
Density  (PHD)  Filter.  Using  machine  learning  techniques,  we  developed  a  formalism  and  algorithms  to  auto¬ 
matically  predict  how  the  visual  appearance  of  a  vehicle  can  change  over  time.  Using  this  prediction  model, 
we  are  then  able  to  automatically  threshold  and  detect  potential  candidate  vehicle  locations,  and  assess  both 
probability  of  detection  and  the  probability  of  clutter. 

To  test  the  performance  of  this  approach,  the  machine  learned  feature  prediction  model  was  combined 
with  a  PHD  filter  and  applied  on  several  WAS  reference  datasets.  The  results  were  quantified  in  terms  of 
track  duration  and  integrity,  and  substantial  performance  benefits  were  obtained.  We  also  discuss  potential 
future  developments  of  these  algorithms. 
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Chapter  1 


Introduction 


1.1  Motivation 

1.1.1  Situation  Awareness 

Developing  situation  awareness  is  vital  for  almost  any  kind  of  military  operation  [2] .  Through  understanding 
the  state  and  nature  of  the  environment,  military  personnel  can  plan  and  respond  accordingly.  Situation 
awareness  is  often  treated  as  the  problem  of  knowing  where  all  the  potential  targets  are.  Through  knowing 
the  locations  of  these  targets,  threats  can  be  identified  and  countered.  Another  important  source  of  awareness 
is  to  understand  where  targets  cannot  be  not.  Regions  which  are  free  of  targets  can  be  used  to  constrain 
where  targets  might  be  [5,12].  Furthermore,  regions  without  targets  can  be  of  direct  tactical  importance  in 
their  own  right.  For  example,  they  can  be  used  to  plan  egress  routes. 

Given  these  operational  considerations,  Wide  Area  Surveillance  (WAS)  offers  an  important  solution. 
Through  the  use  of  sensors  with  a  wide  field  of  view,  large  swathes  of  the  environment  can  be  monitored  si¬ 
multaneously.  This  makes  it  possible  to  identify  individual  targets  and  groups  of  targets  which  can  constitute 
potential  risks. 

An  important  approach  for  conducting  WAS  is  to  use  Wide  Area  Motion  Imagery  (WAMI)  [7]. 

1.1.2  Wide  Area  Motion  Imagery  to  Conduct  Wide  Area  Surveillance 

An  urban  environment  is  monitored  using  an  airborne  camera  sensing  array  with  a  high  spatial  resolution, 
low  frame  rate  (one  to  ten  frames  per  second)  imaging  system.  Using  such  a  high  resolution  image,  large 
numbers  of  targets  can  be  detected  and  tracked  simultaneously.  Figure  1.1  shows  an  example  of  the  imagery 
which  can  be  collected  by  WAMI  systems. 

The  single  frame  provides  a  detailed,  high  resolution  view  of  a  large  part  of  the  environment.  Targets  such 
as  individual  vehicles  are  visible  for  many  frames.  However,  tracking  in  such  images  is  extremely  challenging 
for  many  reasons.  These  include  the  relatively  small  size  of  targets,  the  large  number  of  targets,  and  changes 
in  appearance  due  to  changes  in  environmental  conditions  and  the  relative  attitude  between  the  target  and 
the  camera. 

To  meet  these  challenges,  the  problem  has  undergone  intense  research  and  development  and  a  number 
of  different  systems  have  been  developed.  For  thework  carried  out  here,  we  are  using  the  Likelihood  of 
Features  Tracker  (LoFT)  [7-9].  LoFT  is  an  appearance-based  tracking  system.  Initialised  by  an  operator, 
LoFT  attempts  to  track  a  moving  vehicle  through  a  sequence  of  images.  The  most  significant  difficulty  in 
this  process  is  the  ability  to  consistently  identify  and  track  the  same  target  through  subsequent  frames.  To 
improve  robustness,  a  range  of  image  based  features  are  used.  These  include  gradient  orientation  information 
using  histogram  of  oriented  gradients,  gradient  magnitude,  intensity  maps,  median  binary  patterns  and  shape 
indices  based  on  eigenvalues  of  the  Hessian  matrix. 
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Figure  1.1:  A  frame  from  the  Four  Hills  dataset.  This  single  frame  includes  images  of  a  large  number  of  cars 
and  buildings.  This  data  set  is  used  extensively  in  our  preliminary  investigation. 


5 


Distribution  A:  Approved  for  public  release;  distribution  is  unlimited. 


(a)  Frame  46905.  (b)  Frame  47022. 


Figure  1.2:  Two  frames  from  a  WAMI  sequence.  Note  that  the  different  vantage  points  of  the  camera  mean 
that  different  parts  of  the  street  are  visible  at  different  times.  From  [7]. 


Carl  Car  2  Car  3  Car  4  Car  5 


Figure  1.3:  Some  of  the  challenges  associated  with  tracking  vehicles  even  when  they  lie  within  the  field-of-view 
of  the  camera.  From  [7]. 


Although  LoFT  provides  extremely  important  capabilities  in  tracking  targets,  LoFT  there  a  number  of 
limitations: 

1.  Operators  initiate  the  tracks.  This  supports  a  concept  of  operation  in  which  a  target  vehicle, 
believed  to  be  of  interest,  is  to  be  followed.  However,  this  does  not  support  the  notion  of  a  sensing 
system  which  will  automatically  create  situation  awareness,  particuarly  in  large,  complicated  environ¬ 
ments  with  many  targets. 

2.  Track  loss  can  still  occur.  Track  loss  can  occur  for  a  variety  of  reasons.  These  are  principally 
caused  by  unmodelled  changes  in  appearance.  However,  they  are  also  caused  by 

3.  The  system  only  works  with  positive  returns.  The  tracking  system  works  with  explicit  detections 
of  targets.  As  such,  it  cannot  exploit  information  about  lack  of  detections  -  including  the  effects  of 
occlusion. 

To  investigate  these  issues,  we  are  beginning  to  explore  how  LoFT’s  sophisticated  image  processing 
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algorithms  can  be  combined  with  state-of-the-art  multi-target  tracking  algorithms.  In  particular,  we  use 
machine  learning  techniques  to  model  the  behaviour  of  the  multi-stage  detectors  used  in  LoFT.  This  model 
is  then  used  in  a  Probabilistic  Hypothesis  Density  Filter  (PHD).  Unlike  most  multitarget  tracking  algorithms 
which  create  and  propagate  a  discrete  set  of  tracks,  a  PHD  filter  propagates  target  intensity.  When  integrated 
over  a  region  of  state  space,  it  provides  an  estimate  of  the  average  number  of  targets  which  can  be  found  in 
that  regin.  As  such,  it  can  support  complicated  multimodal  distributions  and  arbitrary  detection  models, 
including  regions  where  no  observations  can  be  made  at  all. 

The  PHD  describes  its  observation  process  through  the  use  of  a  “pseudo-likelihood” .  In  particular,  terms 
for  the  clutter,  the  measurment  likelihood  and  the  probability  of  detection  must  be  specified.  Because  of  the 
complexity  of  image  processing  algorithms,  simple,  closed  form  solutions  for  the  terms  in  these  equations 
cannot  be  derived.  Therefore,  we  decided  to  use  machine  learning  techniques  which  could  model  and  pre¬ 
dict  —  the  behaviour  of  the  detectors  in  LoFT.  Because  we  are  using  function  approximation  techniques,  and 
because  it  is  not  possible  to  exhaustively  collect  data  in  all  possible  operating  conditions,  we  use  a  function 
approximation  approach  known  as  a  Gaussian  Process  (GP).  GPs  are  a  method  of  function  approximation  in 
which  the  estimated  function  value  also  includes  an  explicit  estimate  of  the  accuracy  of  the  approximation. 

1.2  Structure  of  the  Report 

The  structure  of  the  report  is  as  follows.  The  next  chapter  introduces  the  background  on  Wide  Area  Motion 
Imagery  (WAMI)  and  describes  LoFT  in  greater  detail.  Multi-target  tracking  using  a  PHD  filter  is  described 
in  Chapter  3.  Chapter  4  introduces  the  adaptive  framework  for  observation  modelling  that  we  use.  It 
introduces  the  GP,  describes  how  it  was  trained,  and  presents  some  preliminary  results  based  on  simple 
track  likelihood  experiment.  The  full  integration  of  the  GP  filter  and  the  PHD  filter  is  work  in  progress. 
Chapter  5  describes  the  implementation.  The  summary  is  presented  in  Chapter  6. 
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Chapter  2 


Problem  Statement 


2.1  Introduction 

This  chapter  introduces  the  motivation  behind  the  research  and  describes  the  Likelihood  of  Features  Tracker 
(LoFT),  the  tracking  system  which  we  use  as  a  base  to  develop  our  work.  The  structure  of  this  chapter  is  as 
follows.  Section  2.2  introduces  Wide  Area  Motion  Imagery  (WAMI)  and  outlines  its  importance.  Section  ?? 
presents  a  mathematical  model  of  it.  Section  2.3  introduces  LoFT  and  discusses  its  strengths  and  weaknesses. 
We  conclude  by  identifying  potential  areas  of  contribution  by  this  work. 


2.2  Wide  Area  Surveillance 

2.2.1  Motivation 

Situation  awareness  is  critical  for  many  military  operations.  One  way  to  achieve  this  is  actively  by  pointing 
high  resolution  sensors  at  targets  or  areas  of  interest.  However, 

Wide  Area  Surveillance  (WAS):  the  environment  is  continuously  monitored  by  a  sensing  system 
One  source  of  data  is  the  wide-area  large  format  (WALF)  video  that  is  airborne  imagery  characterized  by 
large  spatial  coverage,  high  resolution  of  about  25  cm  GSD  (Ground  Sampling  Distance)  and  low  frame  rate 
of  a  few  frames  per  second.  Wide-area  large  format  imagery  is  also  known  by  several  other  terms  including 
wide-area  aerial  surveillance  (WAAS),  wide-area  persistent  surveillance  (WAPS),  Large  Volume  Streaming 
Data  (LVSD)  and  wide-area  motion  imagery  (WAMI)  [1,4, 6, 7]. 

Tracking  in  such  imagery  is  challenging  as  the  objects  of  interest  are  only  100  square  pixels,  have  seemingly 
large  changes  in  motion  due  to  the  low  frame-rate,  oblique  viewing  angles  of  the  camera  resulting  in  occlusions 
from  tall  structures  apart  from  noise  in  the  images  which  could  be  the  result  of  inaccuracies  in  flight  path 
or  due  to  atmospheric  conditions.  Wide-Area  video  can  help  determine  normal  as  well  as  anomalous  traffic 
patterns  especially  in  complex  urban  environments  where  persistent  tracking  of  an  object  is  challenging  due 
to  the  scene  content  alone.  Due  to  it’s  wide  field  of  view  and  high  resolution  these  images  contain  large 
amounts  of  scene  content.  Such  content  needs  to  be  analyzed  for  events  of  interest  from  a  safety  and  security 
standpoint  using  an  automatic/semi- automatic  process. 

2.2.2  The  Four  Hills  Dataset 

One  example  of  a  dataset  which  is  available  is  the  Four  Hills  dataset.  Four  images  from  this  dataset  are 
show  in  Figure  2.1. 
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Figure  2.1:  Sample  images  from  the  Four  Hills  dataset. 
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Figure  2.2:  The  flowchart  of  LoFT. 


2.3  LoFT 

2.3.1  Overview 

To  address  these  challenges,  the  LoFT  (Likelihood  of  Features  Tracking)  system  was  developed.  The  flowchart 
of  LoFT  is  shown  in  Figure  2.2.  By  clicking  on  a  frame,  an  operator  manually  selects  an  object  which  is 
initiated  within  the  track  management  system.  The  pixels  immediately  around  the  pixel  where  the  frame  is 
clicked  are  interrogated,  and  a  visual  signature,  constructed  of  several  templates,  is  constructed.  A  variety  of 
region-based,  edge-based,  local  shape-based,  and  texture-based  classifiers  are  used  to  improve  the  robustness 
of  the  tracker.  The  features  we  use  in  this  project  are  listed  in  Figure  2.4.  The  LoFT  system  has  been 
developed  over  a  great  deal  of  time,  and  includes  a  great  deal  of  work  on  feature  detectors  [8],  motion 
models  [13]  and  descriptor  and  template  adaptation  [9]. 


2.3.2  State  Model 


The  state  space  of  LoFT  is  defined  in  2D  pixel  coordinates  and  consists  of  the  position  and  velocity  of  the 
target  together  with  the  orientation  of  the  template, 


Xfc 


Vk 

C-k 

Tk 

Ck 

6k 


(2.1) 


where  r^,  Ck  are  the  rows  and  columns  of  the  images,  fk  and  Ck  are  the  column  velocities,  dk  is  the  orientation 
of  the  stored  template. 

The  position  is  assumed  to  evolve  using  a  piecewise  constant  velocity  model  in  pixel  space.  The  orientation 
is  assumed  to  remain  constant.  Therefore,  the  process  model  is 


Tk+i  =  rk  +  ATfcffe 

(2.2) 

c/c+i  —  ck  T  ATfcCfc 

(2.3) 

ffc+i  =  fk 

(2.4) 

Ck-\- 1  —  &k 

(2.5) 

@k+l  =  6k- 

(2.6) 

This  model  can  adequately  describe  the  motion  of  a  2D  template.  However,  it  does  not  directly  account 
for  the  motion  of  the  camera  or  occlusion  effects  in  the  environment.  To  account  for  this,  each  image  is 
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Feature 

Name 

Dimension 

Intensity  Histogram 

hist_I 

10 

Gradient  Histogram 

hist_M 

10 

ARST  Histogram 

hist_Al 

10 

SI  Histogram 

hist_A2 

10 

NC  Histogram 

histJVH 

10 

HoG  Histogram 

hist_H0G 

10 

Intensity  Block 

corr_I 

2500 

Gradient  Block 

corr_M 

2500 

Total 

— 

5060 

Table  2.1:  The  features  used  and  their  dimensions. 


registered  with  respect  to  the  first  frame.  Figure  2.3  shows  a  sequence  of  registered  images.  Although  this 
superficially  appears  to  produce  a  highly  stable  image,  there  are  issues  with 

2.3.3  Observation  Models 

LoFT  performs  feature  fusion  by  comparing  a  target  appearance  model  within  a  search  region  using  feature 
likelihood  maps  which  estimates  the  likelihood  of  each  pixel  with  the  search  window  belonging  to  part  of  the 
target  [9] .  Because  LoFT  is  initialised  manually,  it  does  not  use  detectors  to  identify  the  potential  presence 
of  vehicles.  Rather,  given  a  first  frame,  the  system  constructs  a  visual  signature  which  can  be  used  to  explain 
how  appearance  evolves  over  time.  Specifically,  suppose  a  target  is  present  in  the  image  with  x^..  The  entire 
vehicle  is  assumed  to  be  bounded  within  the  rectangular  region  =  ( Tk,Ck,Wk,hk ),  where  r*,  and  Ck  are 
from  target  state,  and  Wk  and  hk  are  the  width  and  height  of  the  region. 

Given  rfc,  the  target’s  appearance  at  time  k  is  described  by  signature,  which  consists  of  a  set  of  features, 
f k  €  T,  where  each  element  of  fj.  is  a  measurement  of  some  characteristic  of  the  pixels  bounded  by  r^.. 
Figure  2.4  illustrates  some  of  the  features  used.  These  include  mean  colour  intensities,  and  Histograms  of 
Oriented  Gradiants  (HoG).  Table  2.1  summarises  the  dimensions  associated  with  the  features.  Although 
some  of  these  are  relatively  low-dimensional,  the  correlation  blocks  are  very  large  and  the  overall  dimension 
of  the  feature  vector  is  5060. 

In  many  situations,  a  fixed  template  is  not  sufficient  to  ensure  robust  tracking.  However,  the  decision  of 
when  to  update  the  template  is  known  as  the  stability-plasticity  dilenma:  if  the  changes  are  too  frequent,  the 
template  can  capture  subtle  tracking  errors,  occlusions  and  change  in  lighting.  If  it  happens  too  infrequently, 
tracks  will  be  lost.  The  way  this  is  achieved  in  LoFT  is  that,  at  each  frame,  LoFT  attempts  to  estimate  the 
current  rotation  of  the  template.  If  this  exceeds  9k  by  a  fixed  threshold,  a  new  template  is  computed  from 
the  image  and  9k  is  replaced  by  the  new  template  angle. 

In  the  preliminary  work  undertaken  here,  we  do  not  use  template  adapation.  Rather,  the  features  are 
fixed  in  the  first  frame,  and  changes  are  learned  in  subsequent  frames.  We  will  investigate  the  use  of  this 
later. 
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Figure  2.3:  A  sequence  of  registered  frames  in  the  Four  Hills  dataset.  These  are  registered  with  respect  to 
feature  descriptors  associated  with  the  ground  plane. 
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Figure  2.4:  LoFT  features. 
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FEATURE  LIKELIHOOD  FUSION 


Chapter  3 


Multi-target  Tracking 


3.1  Introduction 

The  challenges  posed  by  WAMI  can  be  viewed  in  terms  of  an  example  of  multi-target  tracking.  In  this 
chapter,  we  first  lay  out  the  formulation  of  multi-target  tracking  using  random  finite  sets.  We  then  describe 
the  PHD  filter  formulations. 

3.2  A  Finite  Set  Statistic  Approach  to  Multi  Object  Tracking 

Mahler  argued  that  the  correct  way  to  consider  the  problem  of  multi-object  tracking  is  to  use  random  sets. 
A  random  set  is  a  generalisation  of  a  vector- valued  random  variable  to  the  case  where  the  number  of  random 
variables  is  not  known.  It  can  be  used  to  represent  both  the  distribution  of  objects  in  the  environment,  and 
the  observations  received  from  a  sensor. 

Suppose  at  time  k  there  are  N(k)  targets,  each  one  taking  a  value  in  a  state  space  X.  The  state  of  the 
environment,  Xk  can  be  written  as  the  set 

Xk  =  {x/c.i,  •  •  •  >xfc,iV(fc)}  C  X.  (3.1) 

This  is  a  random  set  —  both  the  cardinality  N(k)  and  the  state  of  each  target  is  unknown  and  must  be 
estimated.1 

The  evolution  of  the  state  is  described  by  the  following  equation, 


Sfc+i|fc  =  T  (Xk)  U B  (Xk)  UB,  (3.2) 

where  T  (Xk)  describes  the  evolution  of  the  persistent  targets,  B  (Xk)  is  the  set  of  spawned  targets  and  B 
are  the  set  of  additional  targets  generated  independently  of  the  existing  targets. 

The  idea  is  that  a  target  survives  with  a  probability  Ps  (x-k,i)-  If  it  survives,  then  it  evolves  using  the 
standard  process  model. 

The  environment  is  observed  by  camera  affixed  to  an  airborne  platform.  We  assume,  for  simplicity,  that 
the  pose  of  the  camera  is  measured  by  an  extremely  accurate  external  sensing  system.  As  a  result,  we  assume 
that  the  pose  of  the  camera  is  perfectly  known  and  is  given  by  the  state  vector  x*. 

As  explained  in  Subsection  2.3.3,  each  frame  in  the  camera  is  processed  using  LoFT’s  detection  system. 
This  yields  a  set  of  detections  together  with  specified  pixel  coordinates.  Suppose  that  M(k)  detections  are 
acquired.  These  are  collected  into  the  observation  set. 

Zk  =  {zfc, i)  •  ■  •  i  zfc,M(fc)}  C  Z.  (3.3) 

1An  important  property  of  the  set  is  that  the  order  of  the  elements  does  not  matter.  Therefore,  Xk  is  equivalent  to  the  set 
{xfc,iV(fc)>  ■■■  -  x;,.j  j .  In  consequence,  there  is  no  strict  ordering  between  the  location  of  an  element  in  the  set  and  the  target  ID. 
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This,  too,  is  treated  as  a  random  set. 

The  measurement  model  which  underlies  this  set  of  detections  is  given  as  follows.  The  measurement  set 
is  given  by 

Sfc  =  T(Xfc,xfe,e)  UC(xfe,e).  (3.4) 

T(X)  is  the  target  detection  set  and  C(X)  is  the  set  of  false  detections.  Suppose  the  environment  consists 
of  n  targets,  and  the  state  of  the  ith  target  is  x, .  The  target  detection  set  is  of  the  form 


rpQ  =  r(xM)  u . . .  u  r(xfciiV(fc))  (3.5) 

where 

T(Xfc,  xfe,  e)  =  %PdW  n  {ZJ  (3.6) 

means  that  the  target  measurement  is  detected  with  a  probability  of  Pd{X)  =  pn{Xk,xk,e).  The  clutter 
process  C(xk,e)  is  given  by 

C{xk,  e)  =  C(xfc,  e).  (3.7) 

These  equations  require  some  further  justification.  The  observation  of  an  actual  target  depends  upon  the 
relative  pose  between  the  platform  and  the  target.  The  probability  of  detection  depends,  in  general,  upon 
the  relative  configuration  of  the  target  as  well.  We  expect  that,  in  general,  detection  algorithms  are  likely 
to  be  more  successful  in  some  configurations  than  others.  The  environment  has  an  impact  on  this  as  well 
—  in  some  environments  a  vehicle  is  likely  to  be  easier  to  be  seen  than  in  others.  Similarly,  the  clutter  is 
generated  by  elements  of  the  environment  —  such  as  parked  vehicles  or  rubbish  bins  —  which  can  appear 
like  cars.2 

However,  although  the  RFS  provides  a  very  general  framework  for  tackling  multi-object  tracking  prob¬ 
lems,  many  of  the  algorithms  have  factorial  complexity  and  thus  have  little  advantage  over  previous  ap¬ 
proaches.  Therefore,  a  new  representation  is  required. 


3.3  Approximate  Multi-Object  Tracking  Through  the  Use  of  PHD 
Filters 

The  fundamental  reason  why  multi-object  tracking  becomes  challenging  is  that,  as  the  number  of  targets 
increase,  the  complexity  of  the  representation  of  the  environment  and  the  computational  complexity  rises. 
Therefore,  one  way  to  address  the  problem  is  to  derive  a  way  of  representing  the  number  of  targets  in  a 
way  that  the  complexity  does  not  increase  with  the  number  of  targets.  The  way  to  achieve  this  is  through 
propagating  the  target  density. 

The  intution  behind  this  approach  is  illustrated  in  Figure  3.1.  The  figure  illustrates  a  typical  multitarget 
tracking  example  and  shows  the  intensity. 

More  formally,  the  intensity  is  the  first  moment  of  the  random  finite  set  Xk  statistic  called  the  Probability 
Hypothesis  Density  (PHD).  Let  define  D  (x| Zk)  as  the  PHD  associated  with  the  multi-object  posterior 
p(Xk\Zk )  at  a  time  step  k.  The  intensity  has  the  property  that 

E[|3Uft|]=  f  D(x\Zk)dx.  (3.8) 

Jn 

It  is  important  to  note  that  the  PHD  is  not  the  same  as  a  probability  distribution.  The  easiest  way  to 
see  this  is  that  integrating  the  PHD  over  the  entire  state  space  yields  the  expected  number  of  targets. 

The  important  practical  advantage  of  the  use  of  the  PHD  is  that,  given  a  number  of  assumptions, 
compact  closed  form  solutions  can  be  derived  which  have  a  computational  cost  which  is  linear  in  the  number 
of  observations. 

“We  do  not  assume  that  the  clutter  is  target  state  dependent  because  it  is  not  clear  how  target-dependent  clutter  would  be 
generated  in  this  scenario. 
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Observation  space 


Predict+Update^  Di._^u_^(x\Z^-A  Predict+Update^  Predict+Update^ 

Target  state-space 


Figure  3.1:  Illustration  of  the  PHD  filter.  The  bottom  figure  shows  the  evolution  of  the  state  in  terms  of 
a  random  set  and  illustrates  the  time  evolution,  including  the  disappearance  of  a  target.  The  top  figure 
shows  the  pattern  of  observations,  including  the  presence  of  clutter.  The  middle  figure  shows  the  intensity 
representation.  Peaks  in  intensity  show  where  the  average  number  of  targets  is  greatest. 
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3.3.1  PHD  Filtering  Equations 

The  PHD  filter  utilises  Random  finite  set  statistics  to  sequentially  propagate  the  intensity  functions  through 
the  Bayesian  steps.  The  main  advantage  of  the  PHD  filter  is  a  formulation  without  a  data  association. 
Instead  the  intensity  function  is  updated  with  the  random  set  of  measurements  Zk  as  it  is  shown  below. 
The  prediction  can  be  realized  through  the  following  equation: 


D(xk\Zk  1)=b(xk)  +  J  ps(xk-1)p(xk\xk-1)D(xk-1\Zk  1)dxfe_i,  (3.9) 

where  b(pck)  denotes  the  intensity  function  of  spontaneous  birth  of  new  objects,  ps(xk_ i)  is  the  probability 
that  the  object  still  exists  at  the  time  step  k  given  its  previous  state  xk_i,  and  p(xk\xk_i)  is  the  transition 
probability  density  of  the  individual  objects. 

For  the  update  model,  it  is  assumed  that  the  false  alarms  obey  the  following  conditions.  First,  the  average 
number  of  clutter  detections  is  Poisson  distributed  with  a  mean  A  =  Afc_|_i(xfc,  e)  false  alarms.  Second,  the 
spatial  distribution  of  these  clutter  terms  is  given  by  c(z)  =  ck+\{xk,  e).  It  is  further  assumed  that  the 
predicted  multitarget  distribution  is  approximately  Poisson. 

Given  these  assumptions,  the  update  equation  can  be  written  as 

D  (xfc| Zk)  *  LZk  ^e(xk)D  (xfc| Z*-1)  ,  (3.10) 


where  L  .  e(xfe)  is  the  PHD  pseudolikelihood.  Its  value  is  given  by 


Lzk,Zk,S*k)  =  1  -Pc(xfe|xfc,e)+ 


PD 


(xfc|xfc,e)y^ 


Lz 


(3.11) 


Ac(z|xfc,e)  +  / pD(xk\xk,e)Lz  ^  e(xk)D  (xk\Zk  1)dxk 


where  pD(xk\xk,e)  is  the  probability  that  the  sensor,  with  state  xk  flying  over  environment  e  is  able  to 
detect  a  target  whose  state  is  xk,  e(xfc)  is  the  likelihood  of  xk  given  observation  z,  A  is  the  average 

number  of  clutter  points  per  scan  and  c(z|xfc,e)  is  the  probability  of  the  clutter  return  z. 


3.3.2  Particle  Filter-Based  Implementation 

In  general,  the  probability  distribution  can  be  hard  to  write  down.  Therefore,  in  this  work  we  use  a  Sequential 
Monte  Carlo  (SMC)  based  implementation  of  the  PHD  filter. 

The  SMC  implementation  approximates  the  PHD  by  a  weighted  set  of  Nk  particles, 

Nk 

D  (xfc| Z  )  «  ^  wk\kd  (xi|i  —  xfc)  >  (3-12) 

i—l 

where  S(-)  is  the  vector  form  of  a  delta  function  and 


Nk 

Eh) 

wk\k  =  Vk\k, 


i—l 


(3.13) 


which  is  the  expected  number  of  targets. 

The  SMC-PHD  filter  consists  of  the  following  steps: 

1.  Predict  target  intensity.  This  consists  of  two  steps:  predict  existing  particles  forwards,  and  mod¬ 
elling  the  spontaneous  birth  of  targets. 
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(a)  Predicting  existing  particles  forwards.  The  process  model  is  applied  to  each  particle  x^fc  to 

generate  a  predicted  particle  The  weights  on  the  particles  are  unchanged,  and  so  wKk^k  = 

(») 

wk\k- 

(b)  Target  birth.  To  model  &(x*,),  we  use  the  approach  proposed  in  [10].  The  idea  is  that,  around 
each  new  measurement,  a  new  set  of  particles  are  created.  The  state  of  each  particle  is  initialised 
by  inverting  the  observable  parts  of  the  state  and  sampling  uniformly  over  the  parts  which  aren’t 
observable.  A  total  of  Nk,new  particles  are  created  this  way.  Each  particle  receives  a  uniform 
weight  equal  to  the  number  of  new  particles  divided  by  the  expected  number  of  targets  which 
appear  at  each  time  step. 

The  result  of  these  two  steps  is  that  the  predicted  PHD  is  of  the  form 

-Nfc+iVfc)Tieu; 

D  (zk+1\Zk)  »  ^  wk+i\k$  (4+i|fc  -  x*+i)  •  (3-14) 

i=l 


2.  Compute  correction  term.  For  each  measurement  Zk+i ,i  in  Zk+ 1,  compute  the  term 


Nk+Nk,', 


Afc+i|fc(zfc+i,i)  =  Ac(zfc+lij|£fc,e)  +  J2  (x^i|fc|xfe,e)  iZfc+i  ij-fc  e  (x£+i|fe)-  (3-15) 


The  important  thing  to  note  is  that  the  correction  term  is  a  function  of  the  clutter  for  each  observation, 
together  with  the  fact  that  the  likelihood  and  probability  of  detection  are  computed  for  each  particle 
separately. 


3.  Update.  The  update  corresponds  to  rescaling  the  particles  by  a  particle  form  of  the  PHD  pseudolike- 
lihood.  Specifically, 

(3.16) 


,,,(*)  -  r 

“T+llfc+l  “  ^Zk,Zk,e[  k+l\k>Wk+l\k 


where  Lz  ^  e(xfc+i|fc)u,fc+i|fc  is  the  particle  form  of  the  PHD  pseudo-likelihood.  Its  value  is  given  by 


„(* 


-lZk,ik,e(Xk+l\k>  -  1  P°  \^Xfc+l|fclXfc’eJ  +  2^ 


i- 1 


(3.17) 


4.  Resample.  As  with  all  SMC  implementations,  resampling  is  required  to  mitigate  the  effects  of  particle 
depletion.  The  average  number  of  targets  is  computed  from 

Nk 

Vk+i\k  =  J2wk\k-  (3.18) 

»= l 


Any  standard  particle  scheme  can  be  used  to  resample  the  number  of  particles.  Once  the  particles 
have  been  resampled,  the  weights  are  multipled  by  r]k+i\k  to  ensure  that  the  average  number  of  targets 
remain  the  same. 


3.3.3  Numerical  Example 

We  consider  the  problem  of  tracking  one  vehicle  of  interest  initiated  by  an  operator  at  time  k  =  0.  When 
using  LOFT  alone,  as  discussed  previously,  track  loss  can  occurs  and  in  this  example  we  aim  to  illustrate 
how  a  standard  PHD  combined  with  LOFT  detections  offer  a  simple  solution.  Instead  of  returning  the  best 
match,  LOFT  is  modified  to  return  a  set  of  detections  (see  Figure  3.3  consistent  with  the  90%  score  of  the 
best  match).  By  doing  so,  the  problem  now  contains  false  alarm  and  misdetections  which  are  well  within 
the  PHD  framework. 
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Given  the  position  of  the  vehicle  in  pixel  coordinates,  a  bounding  box  around  the  vehicle  is  extracted. 
Next,  a  set  of  features  are  extracted  from  this  template  to  initialise  LOFT  for  the  first  image  frame.  The 
PHD-PF  filter  described  in  Section  ??  is  initialised  with  a  set  of  particles  around  the  starting  pixel  location, 
and  the  sum  of  the  weights  of  the  particles  is  set  to  1  since  only  a  single  target  is  considered  as  detected. 
With  respect  to  the  PHD  filter  described  in  Section  3.3.1  and  its  PF  implementation  in  Section  ??,  the 
following  simplifications  and  assumptions  are  made: 

•  the  state  is  defined  in  the  image  space  and  is  constituted  of  2  coordinates,  i.e.,  x/c  is  a  2D  pixel 
coordinates. 

•  Nk  =  2000  particles  are  used  at  each  step. 

•  The  evolution  model  is  kept  simple  and  independent  of  the  camera’s  state  x.k.  A  random  walk  evolution 
model  is  used  with  a  Gaussian  distribution  chosen  to  be  of  0  mean  with  a  standard  deviation  equal  to 
60  pixels. 

•  Since  we  only  want  to  track  one  vehicle  initiated  by  the  operator,  the  birth  process  step  is  eliminated, 
i.e.  fe(xfc)  =  0  and  Nk^new  =  0  in  (??). 

•  The  likelihood  calculation  is  independent  of  the  camera  position  and  the  environment,  e.g.,  the  proba¬ 
bility  of  detection  is  not  state  dependent  and  is  chosen  p£>(xj,,|xfc,  e)  =  0.99  and  the  clutter  distribution 
is  a  uniform  Poisson  process  with  parameters  A  =  Afc+i (xfc,e)  equal  to  the  size  of  the  image  frames 
and  c(z)  =  cfc+i(xfe,e)  =  5  . 

Figure  3.3  shows  a  set  of  14  frames  where  a  vehicle  of  interest  is  tracked  using  LOFT  alone  and  by  extracting 
the  set  of  detections  that  are  consistent  with  90%  of  the  best  score.  While  we  can  manually  verify  that  the 
vehicle  of  interest  is  always  tracked  the  set  of  candidates  grows  and  it  is  difficult  for  an  operator  to  interpret 
the  output. 

While  the  PHD  implementation  described  here  is  very  simple  especially  with  a  minimal  evolution  model 
the  results  shown  in  Figure  3.1  are  encouraging.  In  comparison  to  the  loft  detections  it  can  be  seen  that 
the  PHD  filters  most  of  the  candidates  and,  at  worst,  displays  a  multimodal  distribution,  on  that  regard,  it 
offers  a  better  readability  for  the  operator. 
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Figure  3.2:  Example  of  LOFT  detection  on  14  successive  frames  using  a  template  initiated  by  an  operator 
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image  2  PHD  targets 


image  3  PHD  targets 


image  4  PHD  targets 


image  5  PHD  targets 


image  6  PHD  targets 


image  7  PHD  targets 


image  8  PHD  targets 
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image  10  PHD  targets 


image  1 1  PHD  targets 


image  12  PHD  targets 


image  13  PHD  targets 


image  15  PHD  targets 


image  14  PHD  targets 


Figure  3.3:  Example  of  LOFT  detection  on  14  successive  frames  using  a  template  initiated  by  an  operator 


23 


Distribution  A:  Approved  for  public  release;  distribution  is  unlimited 


Chapter  4 

Probabilistic  Model  of  the 
Observations 


4.1  Introduction 

In  this  chapter,  we  describe  the  approach  we  have  taken  to  develop  the  model  of  the  observation  process. 
The  observation  process  is  modelled  in  terms  of  the  PHD  pseudolikelihood  presented  in  (3.11)  and  consists 
of  three  terms:  the  probabilty  of  detection  p£>(xfc|x;t,  e),  the  measurement  likelihood  ^  e(xfc)  and  the 

clutter  process  Ac(z|x*,,e).  For  this  preliminary  report,  we  focus  on  the  development  of  an  algorithm  to 
predict  the  future  appearance  of  the  vehicle.  This  prediction  underpins  both  the  probability  of  detection  and 
clutter  processes.  One  element  of  our  ongoing  work  is  to  refine  this  process  into  the  detection  and  clutter 
terms. 

The  structure  of  this  chapter  is  as  follows.  Section  4.2  describes  the  challenges  which  exist  in  tracking 
the  visual  appearance  of  the  vehicle  over  time.  Section  ??  describes  Gaussian  processes,  which  are  the  key 
theoretical  tool  we  use  to  achieve  prediction  process.  Section  4.4  describes  the  feature  prediction  model  and 
how  it  was  tuned.  Section  ??  evaluates  the  performance  of  the  prediction  model  in  terms  of  its  ability  to 
compute  total  track  likelihood  for  ground  truthed  tracks  in  a  test  set. 


4.2  Modelling  Changes  in  Visual  Appearance 


The  idea  is  as  follows.  Suppose  a  template  was  taken  at  a  time  kt  where  kt  <  k.  At  this  point,  the  target 

•  * 

was  in  the  state  x^t  and  the  platform  was  in  the  state  x^{ .  The  template  is  described  by  the  set  of  features 
ffct.  The  goal  is  to  now  try  to  identify  the  target  at  time  step  k.  From  the  estimated  target  state,  a  Region 
of  Interest  (ROI)  can  be  computed.  This  is  decomposed  into  a  set  of  n  rectangular  blocks.  Within  the  ith 
block,  the  feature  fj.^  is  computed. 

We  predict  how  the  feature  will  appear  at  time  k,  ffc  and  compute  the  difference  df;,  =  f[1'  B  f;, .  If  5fk 
is  sufficiently  small,  a  detection  is  generated  and  an  observation  is  inserted  into  the  observation  set  (3.3). 
Once  all  the  blocks  have  been  processed,  the  set  of  observations  can  then  be  passed  to  the  PHD  filter  to  be 
updated. 

The  prediction  equation  is  of  the  form 


f k  =  f 


7  ;  Xfct  7  7  Xfc  7  ® 


(4.1) 


However,  this  function  arises  implicitly  through  the  use  of  the  many  difficult  visual  descriptors  used  in 
LoFT.  As  such,  it  is  not  possible  to  write  an  explicit  expression  down.  Rather,  we  use  machine  learning 
techniques  to  learn  an  approximation  of  it.  In  particular,  we  use  Gaussian  Processes. 
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Figure  4.1:  Slight  variation  in  target  appearance  and  background  over  time. 


4.3  Approximating  Functions  Using  Gaussian  Processes 

Gaussian  Processes  (GPs)  are  widely  used  in  probabilistic  function  approximation.  Consider  the  function 

V  =  f  [x]  (4.2) 

where  x  is  the  input  and  y  is  the  (scalar)  output.1  The  form  of  f[-]  is  not  known  and  is  to  be  approximated 
empirically.  A  training  set  V  of  n  measurements  has  been  collected,  where  T>  =  {(xj,yj)|i  =  1,2,  ...,n}. 
The  column  of  input  vectors  is  aggregated  into  the  design  matrix  X ,  and  the  training  output  y.  Suppose  the 
output  of  the  function  is  to  be  estimated  at  some  test  value  x* .  The  approximation  y* ,  is  a  random  variable 
whose  the  probability  density  function  given  by 


V*  ~  p{f* |x*,X,y).  (4.3) 

The  reason  why  y*  is  a  random  variable  is  because  reflects  the  uncertainty  in  the  implicit  estimation  of 
the  function.  The  training  and  test  data  enter  as  conditioning  random  variables. 

A  GP  explicitly  assumes  that  y*  is  Gaussian-distributed.  Therefore,  to  fully  specify  (4.3),  expresses  for 
the  mean  and  covariance  must  be  provided.  The  GP  provides  a  formulation  for  this  through  the  specification 
of  a  kernel  function,  which  specifies  the  second  order  moment  of  the  approximation  computed  at  two  different 
test  values. 

Many  different  choices  for  kernel  functions  exist.  For  our  application,  we  use  the  widely-adopted  squared 
exponential  kernel. 

Figure  4.2  illustrates  the  action  of  the  GP  for  one  feature  for  basic  time.  This  diagram  illustrates  an 
important  property  of  the  GP:  when  the  test  value  x*  is  close  to  an  element  in  the  training  set,  the  prediction 
error  is  small.  This  is  reflected  by  the  smaller  covariance..  As  x*  moves  further  from  the  nearest  point  in 
the  training  set,  the  covariance  gradually  increases.  This  reflects  the  fact  that  the  value  of  the  function  to 
be  approximated  is  far  from  any  member  of  the  training  set,  and  thus  there  is  considerable  uncertainty. 


4.4  Design  of  the  Feature  Prediction  Model 

We  make  three  assumptions  in  the  development  of  the  model: 

1  Although  GPs  can  be  extended  to  vector- valued  predictions,  we  use  the  scalar  formulation  described  here.  This  is  both 
computationally  simpler,  and  is  a  better  fit  for  our  PCA-compressed  feature  space,  where  each  feature  is  assumed  to  be 
independent  of  all  other  features. 
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Figure  4.2:  A  standard  example  of  a  GP.  The  input  data  (x)  and  function  values  ( y ).  The  function  approx¬ 
imation  rises. 


1.  The  mean  function  can  be  adequately  approximated  by  a  constant. 

2.  The  feature  vector  is  composed  of  the  individual  features  listed  in  Table  2.1.  We  assume  that  these  can 
be  treated  as  a  set  of  independent  scalar  features,  each  of  which  can  be  modelled  as  an  independent 
Gaussian  Process. 

4.4.1  Dataset 

The  observation  model  was  tuned  using  the  Four  Hills  dataset.  Some  frames  from  this  are  shown  in  Figure  2.1. 
This  was  used  to  both  train  and  test  the  models.  To  train  the  models,  a  subset  of  vehicles  were  manully 
tracked  from  frame-to-frame.  The  test  set  consists  of  a  different  set  of  models. 

Figure  4.3  shows  one  example  of  a  training  set  which  is  used.  This  consists  of  approximately  20  frames 
of  a  vehicle  driving.  Variations  include  changes  in  orientation  and  scale.  In  addition,  there  is  the  presence 
of  clutter  from  other  vehicles. 

4.4.2  Dimensionality  Reduction  Using  PCA 

The  set  of  features  that  we  use  from  LoFT  are  listed  in  Table  2.1.  Rather  than  use  these  features  directly, 
we  apply  Principle  Component  Analysis  (PCA).  There  are  two  reasons  for  this.  The  first  is  to  transform 
each  feature  space  into  one  where  individual  features  are  uncorrelated.  The  second  is  to  reduce  the  number 
of  dimensions  by  keeping  only  the  dimensions  with  the  most  variance,  which  are  likely  to  be  the  most 
informative. 

Two  criteria  were  explored  in  the  choice  of  PCA.  The  first  criteria  was  to  remove  dimensions  from  each 
feature  state  until  the  variance  of  the  reduced  set  was  at  least  90%  of  the  original  set.  The  results  of  this  are 
shown  in  Table  4.1.  This  greatly  reduces  the  number  of  dimensions  (from  5060  to  159).  However,  the  block 
correlation  features  are  still  very  high  dimensional.  Therefore,  a  second  criteria  was  used.  In  this  case,  the 
number  of  dimensions  for  each  feature  was  not  allowed  to  exceed  10.  The  results  in  Table  4.1  show  that  this 
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Figure  4.3:  Region  of  Interest  corresponding  to  target  20  in  the  Four  Hills  dataset. 
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Unlimited  dimensions;  90%  variance 

Max  10  dimensions;  unlimited  variance 

Feature 

Dimension 

Dimension 

%  Variance 

Dimension 

%  Variance 

hist_I 

10 

2 

94% 

2 

94% 

hist_M 

10 

1 

99% 

1 

99% 

hist_Al 

10 

1 

100% 

1 

100% 

hist_A2 

10 

1 

99% 

1 

99% 

hist.VH 

10 

4 

93% 

4 

93% 

hist_H0G 

10 

4 

90% 

4 

90% 

corr_I 

2500 

30 

90% 

10 

77% 

corr_M 

2500 

116 

90% 

10 

45% 

Total 

5060 

159 

— 

33 

- 

Table  4.1:  The  original  number  of  dimensions  in  each  feature,  together  with  the  results  of  compressing  the 
features  using  PCA.  The  first  set  of  columns  show  the  minimum  size  of  feature  required  when  the  variance 
is  90%  of  the  original  value.  The  second  set  of  columns  show  the  effect  on  the  variance  when  the  maximum 
number  of  dimensions  for  each  feature  is  clamped  at  10. 


is  only  required  for  the  correlation  features  and  that  the  approximation  in  corr_M  is  much  greater  than  that 
in  corr_I. 

4.4.3  Choice  of  the  Dependent  Variables 

We  need  to  specify  the  dependent  variables  which  are  used  in  (??).  Although  all  the  variables  listed  could  be 
used,  this  it  not  preferable  for  two  reasons.  The  first  is  that  higher  dimensional  solutions  can  be  expensive 
and  wasteful.  The  second  is  that  not  all  the  information  is  available.  For  example,  we  are  currently  using 
models  in  2D.  As  a  result,  although  bundle  adjustment  can  be  used  to  reconstruct  the  sequence  of  values  for 
x  (see  Section  2.2.2),  this  information  cannot  be  directly  exploited  in  the  2D  formulation. 

We  considered  the  following  choices  for  the  dependent  variables. 

Time 

This  is  perhaps  the  simplest  approach.  The  idea  is  to  model  the  covariance  as  a  decreasing  function  of  time. 
The  rationale  is  that,  as  time  progresses,  the  change  in  visual  appearance  will  increase  through  time.  Using 
a  squared  exponential,  the  covariance  function  is  of  the  form 


k{sk,s'k)  =  a2  ex p 


~  ( t-?r 
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(4.4) 


The  rationale  behind  this  choice  is  that,  as  time  progresses,  the  changes  to  the  appearance  can  become 
greater.  Furthermore,  all  the  datasets  exhibit  a  constant  angular  rotation  as  a  result  of  how  the  aircraft  is 
moving.  Therefore,  time  is  potentially  a  proxy  for  angular  change. 


Template  Orientation 

Here,  the  idea  is  to  look  at  the  rotation  of  the  template  directly  in  image  coordinates.  This  factors  in  changes 
due  to  the  rotation  of  the  platform  and  the  turning  of  the  vehicle. 


Pixel  Coordinates 

As  above,  but  replace  sk  with  ( t,rk ),  so  we  are  measuring  similarity  between  image  positions  and  time. 
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Platform  Orientation 


This  approach  extends  the  covariance  to  be  a  decreasing  function  of  the  change  in  viewing  angle  on  the 
vehicle.  We  compute  this  from  extracting  the  camera  pose  information  from  bundler,  and  computing  the 
relative  transformations  between  pairs  of  terms.  As  a  result,  the  kernel  incorporates  the  deterministic  change 
in  platform  orientation  caused  by  the  aircraft  movement. 

it  '\  2  1  ^  (sk,t  —  s'kt)2 

fc(sfc,sfe)  =  er  exp  --  ^ - 75 -  (4-5) 

k=i 

For  the  straight  line  roads,  the  majority  of  the  change  in  orientation  is  caused  by  the  movement  of  the 
platform.  Therefore,  we  can  use  bundler  to  replace  the  orientation  with  it.  This  will  work  well  when  the 
orientation  change  is  due  to  the  platform  only.  This  will  model  the  case  where  the  vehicle  is  driving  along  a 
straight  road.  However,  it  is  not  good  in  general. 

World  Coordinates 

The  previous  two  approaches  are  approximations.  The  fundamental  reason  why  the  appearance  changes  is 
because  the  vehicle  moves  through  the  environment,  and  it  is  viewed  from  different  angles.  Therefore,  this 
set  of  features  attempts  to  capture  all  these  elements  together.  It  consists  of: 

•  xt 

•  time 

•  (c t  —  Xt)  in  polar  coordinates,  with  radius  in  log  scale. 

4.5  Evaluation 

To  evaluate  the  performance  of  the  prediction  models  we  explored  its  ability  to  compute  the  joint  track 
likelihood  on  a  set  of  ground  truth  tracks.  The  intuition  is  as  follows.  If  the  GP  can  predict  the  measurement 
likelihood  L z  ^  e(x)  accurately,  it  should  produce  a  more  accurate  estimate. 

Use  ground  truthed  trajectories  from  the  Four  Hills  dataset.  The  joint  probability  is 

k  k 

p(x  i:fc,  Zi:fe)  OC  f>(xi )  L  ^  ^  (x»)  p(xi+l  |  X*  )  (4.6) 

i= 1  i—2 

Conditioned  on  the  ground  truth  dataset  Xi,*,,  this  simplifies  to 

k 

p(zi:fc|xi:fc),  a  (4'7) 

i— 1 

pseudolikelihood.  As  explained  in  Section  3.3.1,  this  term  is  a  standard  likelihood  model  which  computes 
the  probability  of  observing  z,  conditioned  on  the  pose  of  the  platform,  the  state  of  the  environment,  and 
the  fact  that  the  target  is  in  state  x.  One  way  to  assess  the  quality  of  this  approximation  is  if  this  likelihood, 
when  used  in  a  maximum  likelihood  estimator,  is  able  to  correctly  predict  the  correct  state.  In  particular, 
consider  a  sequence  of 

Suppose  a  ground  truth  target  x  together  with  an  observation  z  are  known.  We  should  find  that 

iG,eW>iz,i,e(i  +  &)>  (4'8) 

where  <5x  is  a  perturbation  on  the  nominal  state. 
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Figure  4.4:  The  observation  associated  with  the  ground  truth  and  with  the  perturbed  states. 


(a)  90%  of  the  variance  is  retained  and  the  number  (b)  90%  of  variance  is  retained,  but  the  maximum 
of  dimensions  per  feature  is  not  limited.  number  of  dimensions  per  feature  cannot  exceed  10. 

Figure  4.5:  The  log  likelihood  model  using  two  constraints  for  PCA.  The  first  ensures  that  the  variance  is 
90%  of  the  original  for  each  feature  and  does  not  constrain  the  number  of  dimensions.  Each  graph  show  the 
relative  log  likelihood,  scaled  with  respect  to  the  maximum  value  of  the  likelihood  function.  The  breaks  in 
the  lines  for  the  unlimited  case  is  caused  by  the  fact  that  the  likelihood  becomes  zero  and  the  logarithm  is 
undefined. 


To  test  this,  we  explored  the  performance  of  several  ground  truthed  tracks  in  the  Four  Hills  training 
data.  These  provide  a  sequence  of  coordinates,  x,  together  with  observations.  The  perturbations  <5x  were 
drawn  from  windows  related  with  scale,  rotation  and  offset.  Figure  4.4  shows  an  example  of  the  template 
associated  with  the  ground  truth  observation,  together  with  a  set  of  peturbations. 

Figure  4.5  plots  the  log  likelihood  results.  As  can  be  seen,  both  choices  of  the  state  space  show  a  very 
strong  return  at  the  location  of  the  vehicle,  and  very  weak  returns  away  from  the  vehicle.  This  shows  they 
are  discriminative.  However,  perhaps  surprisingly,  the  performance  of  the  90%  variance  seems  to  produce 
slightly  worse  and  numerically  unstable  results.  This  is  likely  to  be  caused  by  numerical  issues  and  /  or 
overfitting.  Therefore,  this  initial  investigation  suggests  that  we  will  be 

4.6  Clutter  Model 

In  addition  to  modelling  the  likelihood  that  a  given  feature  vector  is  generated  by  a  target,  it  is  also 
necessary  to  model  the  likelihood  that  a  feature  was  generated  by  background  clutter.  Only  by  comparing 
these  likelihoods,  can  we  then  estimate  the  probability  that  a  given  feature  vector  was  actually  generated  by 
a  target  versus  some  other  irrelevant  feature  in  the  environment. 

Here,  similar  intuitions  apply  to  appearance  of  clutter,  as  to  targets.  In  particular,  we  may  generally 
expect  to  see  spatial  and  temporal  correlations,  since  parts  of  the  environment  viewed  close  together  in 
space  and  time  are  likely  to  appear  similar.  Moreover,  different  types  of  terrain  are  also  likely  to  generate 
similar  feature  vectors.  For  example,  one  would  expect  different  sections  of  road  to  look  similar  to  each 
other,  but  different  from  buildings,  vegetation,  or  other  vehicles.  With  this  in  mind,  we  again  model  clutter 
as  a  Gaussian  Process.  However,  in  addition  to  the  regression  inputs  described  above,  we  also  maintain 
separate  means  and  covariance  parameters  for  different  types  of  terrain.2  These  models  are  then  trained  by 

2  Although  terrain  types  may  be  user  defined,  and  specified  using  maps  and  other  prior  knowledge,  here  we  use  an  unsupervised 
approach,  by  automatically  discovering  terrain  classes  by  clustering  based  on  feature  vector  values. 
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computing  feature  vectors  for  parts  of  each  camera  frame  at  random,  which  are  known  not  to  be  associated 
with  a  target. 
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Chapter  5 


Implementation  and  Integration 


5.1  Introduction 

In  this  chapter,  we  outline  our  implementation  of  the  PHD  filter.  This  builds  upon  the  PHD  filter  described 
in  Chapter  3  and  the  observation  model  developed  in  Chapter  4.  However,  a  crucial  element  is  that  the  most 
effective  results  are  obtained  when  we  transform  the  tracking  system  into  3D.  This  required  reformulating 
the  tracking  system. 

The  structure  of  this  chapter  is  as  follows.  Section  5.2  provides  an  overview  of  how  the  tracking  algorithm 
works.  Section  5.3  describes  how  the  2D  tracking  problem  was  transformed  into  a  full  3D  problem. 

5.2  Overview  of  the  LoFT-PHD  Implementation 

The  combined  procedure  for  tracking  each  target  that  a  user  wishes  to  observe  proceeds  as  follows. 

1.  At  a  time  ko,  the  operator  indicates  a  vehicle  they  wish  to  be  tracked.  This  target  is  labelled  target  i. 
The  position  of  the  target  is  computed  in  pixel  coordinates,  and  a  template  t  j;fc0  is  extracted.  A  set  of 
features  f,jCo  are  extracted  from  this  template  and  are  used  for  subsequent  matching.  The  PHD  filter 
itself  is  initialised  with  a  set  of  particles.  The  particles  are  clustered  around  the  start  location,  and  the 
sum  of  the  weights  of  the  particles  is  1  because  a  single  target  has  been  detected. 

2.  Using  an  initial  estimate  of  the  target’s  heading  and  velocity,  a  search  region  is  constructed  around 
each  particle’s  position,  and  is  used  to  specify  an  ROI  within  the  next  camera  frame,  in  which  to  search 
for  the  target  at  time  k  +  1. 

3.  As  described  in  the  previous  chapter,  the  ROI  is  decomposed  into  a  set  of  n  overlapping  rectangular 
blocks,  each  representing  a  candidate  successor  position  of  the  target’s  appearance  in  the  camera’s  field 
of  view  at  time  k  +  1.  Using  the  computer  vision  descriptors  from  LoFT,  we  then  construct  a  feature 
vector,  f|^,  for  each  block. 

4.  For  all  candidate  f].^  constructed  above,  we  use  our  Gaussian  Process  observation  model  to  compute 
the  likelihood  that  each  f;-.?  was  either  generated  by  the  target,  or  by  background  clutter.  If  for  any 
given,  ,  the  ratio  of  these  likelihoods  passes  a  predefined  threshold,  its  corresponding  position  is  used 
to  construct  a  measurement ,  Zk+i,i  in  Z^+i-  In  this  way,  candidate  positions  that  have  low  likelihood 
of  being  associated  with  the  target  are  eliminated,  leaving  only  a  small  number  of  measurements  that 
have  a  high  likelihood  of  being  associated  with  the  target  position. 

5.  Using  the  measurement  set,  Z^+ 1,  constructed  above,  particles  are  updated  in  the  usual  way  as  de¬ 
scribed  in  Section  3.3.1.  In  particular,  the  history  of  all  previous  states  of  each  particle  now  represents 
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an  hypothesed  track,  which  the  target  may  have  followed,  and  can  be  associated  with  the  set  of  mea¬ 
sured  feature  vectors  used  to  update  its  state.  Together,  these  feature  vectors  represent  the  target’s 
changing  appearance  over  time,  if  its  position  had  followed  a  given  particle.  In  principle,  when  updat¬ 
ing  a  give  particle’s  position  in  future  timesteps,  the  history  of  all  its  associated  feature  vectors  should 
be  fed  into  the  observation  model,  and  used  to  compute  the  likelihood  that  any  future  measurement  is 
associated  with  that  particle.  This  is  because  a  target’s  appearance  in  future  timesteps  is  dependent 
on  its  appearance  in  previous  timesteps.  However,  in  the  interest  of  computational  efficiency,  it  may 
not  be  possible  to  maintain  a  history  of  all  feature  vectors  associated  with  each  particle.  Instead,  only 
the  n  most  recent  features  associated  with  each  particle  may  be  maintained,  since  these  are  likely  to 
have  a  higher  correlation  with  a  target’s  future  appearance,  thus  leading  to  better  predictions  of  its 
future  position.  Nevertheless,  by  maintaining  different  feature  vectors  for  each  particle,  we  no  longer 
need  to  commit  to  a  single  template  to  estimate  a  target’s  current  appearance.  Instead,  each  particle 
is  associated  with  a  full  probability  distribution  over  a  target’s  likely  appearance  given  its  previous 
measurements.  In  this  way,  we  can  maintain  multiple  hypotheses  about  a  target’s  current  position, 
and  give  a  more  complete  picture  of  the  uncertainty  surrounding  its  current  position. 

In  summary,  we  now  have  a  complete  procedure  for  maintaining  multiple  hypotheses  about  the  current 
position  of  a  single  target,  where  each  hypothesis  corresponds  to  a  given  particle  in  the  PHD  filter  implemen¬ 
tation,  and  is  weighted  by  the  likelihood  that  it  represents  the  target’s  true  position.  In  addition,  multiple 
targets  can  be  handled  in  the  same  way,  by  simply  allowing  a  user  to  initiate  multiple  tracks  over  time  using 
the  same  procedure.  Each  target  would  then  be  associated  with  a  different  subset  of  particles  within  the 
PHD  filter,  but  in  all  other  ways,  the  procedure  remains  unchanged. 


5.3  Transforming  the  Scenario  into  3D 


As  explained  in  Chapter  4,  appearance  changes  are  fundamentally  driven  by  the  motion  of  the  vehicle  and 
the  platform  in  3D  space.  Therefore,  rather  than  use  the  2D  tracking  described  in  Section  2.3.2,  we  used 
a  full  3D  model.  This  involved  two  steps.  First,  a  3D  model  of  camera  motion  and  the  enviornment  was 
constructed  using  Bundler  [11]  and  CMVS  [3].  Figure  5.1  shows  the  bundled  model  which  was  constructed. 
Second,  the  cameras  in  the  constructed  model  were  aligned  with  the  metadata  on  the  location  of  the  camera 
using  a  Procrustes  minimisation  algorithm.  Figure  5.2  shows  the  camera  metadata,  together  with  the  closest 
alignment  of  the  cameras  with  the  metadata.  The  covariance  matrix  of  the  errors  in  the  alignments  is  given 

by 


21.1724  16.4079  0.3117 
16.4079  28.4463  0.4906 
0.3117  0.4906  0.6920 


(5.1) 


Given  the  scale  of  the  scenario,  these  errors  are  extremely  small,  and  therefore  we  believe  that  a  very 
accurate  result  has  been  obtained.  This  is  confirmed  by  Figure  5.3,  which  overlays  the  detections  used  in 
the  training  data  on  the  3D  model.  To  project  these  2D  projections  into  3D  space,  the  rays  were  intersected 
with  a  ground  plane  constructed  from  the 

As  can  be  seen,  there  is  a  good  agreement  between  the  detection  locations  and  the  road  network.  However, 
some  misalignment  can  be  seen.  This  is  possibly  due  to  slight  angular  errors. 
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(b)  The  dense  3D  model  produced  using  CMVS. 


Figure  5.1:  The  3D  model  constructed  using  Structure-from-Motion. 
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Figure  5.2:  The  metadata  (blue  crosses)  and  the  aligned  camera  positions  (red  dots).  Note  that  the  horizontal 
and  vertical  scales  are  very  different. 
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Figure  5.3:  The  projected  detections  from  a  ground  truth  dataset  (crosses)  projected  on  the  point  cloud 
(purple  dots).  The  camera  frustum  is  the  grey  trapezoid. 
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Chapter  6 


Summary  and  Conclusions 


6.1  Summary 

This  report  has  summarised  the  work  to  date  on  the  project  “Modelling  and  Characterisation  of  Detection 
Models  in  WAMI  for  Handling  Negative  Information”.  This  project  investigatesq  how  obervations  models 
can  be  used  in  state  of  the  art  multi-target  tracking  algorithms.  In  particular,  a  machine  learning  technique 
known  as  Gaussian  processes  has  been  used  to  develop  a  model  which  explains  how  features  extracted  from 
a  template  describing  vehicle  appearance  can  evolve  forwards  through  time. 


6.2  Outstanding  Work 

Although  this  report  describes  work  to  date,  there  a  number  of  issues  which  are  currently  in  development. 

6.2.1  3D  Formulation  of  the  Tracking  Problem 

As  explained  in  Chapter  5,  we  are  re-formulating  the  tracking  problem  natively  in  3D.  We  have  transformed 
the  data  into  3D,  and  we  have 

To  be  consistent  with  LoFT,  we  have  posed  the  tracking  problem  purely  in  terms  of  pixel  coordinates. 
However,  this  means  it  is  not  possible  to  properly  model  effects  such  as  the  change  in  distance  between  the 
vehicle  and  target,  and  the  rotation  caused  by  the  orbit  of  the  platform.  Furthermore,  uneven  motions  of 
the  platform  cause  large,  uncompensated  movements  in  the  locations  of  the  targets,  posing  challenges  to  the 
trackers. 

Therefore,  the  first  thing  we  will  do  is  to  reformulate  the  tracking  problem  entirely  in  3D.  In  particular, 
we  will  use  the  bundler-derived  estimate  of  the  extrinsic  properties  of  the  camera  to  model  the  motion  of 
the  camera  through  space.  By  doing  this,  we  will  be  able  to  account  for  the  relative  attitude  between  the 
vehicle  and  the  platform.  The  GP  will  be  trained  using  the  world  coordinates  as  descrbed  in  Section  4.4.3. 
The  motion  models  will  be  updated  to  describe  the  trajectory  in  3D. 

6.2.2  Training  and  Validation 

We  currently  use  38  training  tracks  from  the  Four  Hills  dataset.  We  will  seek  to  extend  this,  by  generating 
further  training  sets  within  Four  Hills,  and  also  from  other  sets  as  well.  We  currently  have  access  to  the 
Albuquerque  set. 
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